{lib}[foss/2024a] TensorFlow v2.18.1 w/ CUDA 12.6.0#22921
{lib}[foss/2024a] TensorFlow v2.18.1 w/ CUDA 12.6.0#22921boegel merged 10 commits intoeasybuilders:developfrom
Conversation
Updated software
|
|
@boegelbot please test @ jsc-zen3-a100 EB_ARGS="--include-easyblocks-from-pr 3699" |
|
@pavelToman: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de PR test command '
Test results coming soon (I hope)... Details- notification for comment with ID 2895362254 processed Message to humans: this is just bookkeeping information for me, |
|
Test report by @boegelbot |
Failing without patches from #22848 |
|
Tests are failing on Joltik with 16 cpus + 36gb ram + eb --parallel 10 |
|
Tests also failing on accelgor: |
|
It fails also on Donphan now: https://gist.github.com/pavelToman/1b422fc88ec080014ec4cc0edfd42548 I can see CUDA/12.6.0/nvvm/libdevice/libdevice.10.bc Should I add conda-forge/tensorflow-feedstock#420 But still I do not understand why it worked before on Donphan without XLA_FLAGS set... |
|
This is likely caused by TF not using the right flags during build. With some source grepping I found I didn't get much further but I found https://openxla.org/xla/hermetic_cuda?hl=de#pointing_to_cudacudnnnccl_redistributions_on_local_file_system
And in the TF sources I found:
Hence this needs an update of the easyblock Spack also uses: TF_CUDA_PATHS (NCCL, Tensorrt, cuda, cudnn prefixes), not sure if that is required. I'd guess the above are enough |
|
I add env vars LOCAL_CUDA_PATH, LOCAL_CUDNN_PATH, LOCAL_NCCL_PATH to configure step and either all the --repo_env=LOCAL... to build step, but hitting this error: There is CUDA/12.6.0/extras/CUPTI/include/cupti.h |
|
You might need to ask in the TF repo. An idea: I found a search for the header at https://github.com/tensorflow/tensorflow/blob/cdced3fe1d6378853f57bcd09e9a8472639a264f/third_party/gpus/find_cuda_config.py#L327 It uses Interpreting this roughly seems to suggest it will create symlinks. I assume you only need those repo-env params, not env variables. |
|
From the log I can see this subcommand: The CUPTI should be in CPATH but it is not - maybe this is a problem? |
|
The error says it can't find The path is something the build creates. So it either needs a bundled CUDA extracted there (which we don't want) or it should create symlinks at this place which is what I suspect it should do. As Spack sets Looking at their config file we might need:
But it seems to not have been updated for a while and checking the configure.py from TF it is:
So I'd set those env vars and pass |
|
I can see we have in tensorflow.py: |
|
Still the same error with cupti.h, although I add all the vars either to config step and via --repo-env to build step. |
|
Maybe I found something: jax-ml/jax#23689 (comment) |
|
I've had a lively discussion with the XLA devs about that and the result is basically: Don't use a preinstalled CUDA, I see 4 options:
|
|
If the issue is only caused by the location of a single file like We can symlink those files to their alternatives locations in our installations of CUDA. That will not affect any well-behaved software and should fix the issue with TF. |
Still not too bad though |
|
@pavelToman @Flamefire I made a PR for option 3 as discussed: easybuilders/easybuild-easyblocks#3791 |
|
So that might now work with easybuilders/easybuild-easyblocks#3765 I'll test that |
|
Test report by @Flamefire |
|
Looks good now. @pavelToman Can you update this branch to include the missing files please? (rebase on or merge current develop) |
Thank you! I think the error on JSC was there all the time but before it was hidden by 'excessive module-command related logs', as mention before or by not working gpu from boegelbot tests |
Adding to that: This doesn't influence the relevant part of the compilation. What is actually the intention of this (besides just trying to workaround the current issue)? Maybe we need another change in the easyblock for that. |
This piece of code was added 7 years ago in easybuilders/easybuild-easyblocks#1436 by Damian Alvarez |
|
Test report by @Flamefire |
|
Test report by @pavelToman
|
|
Test report by @boegelbot |
|
Test report by @Flamefire |
|
Test report by @pavelToman |
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
@Thyre Do you have any idea why rebuilding |
The module itself looked fine, though I haven‘t checked the contents of all files. Maybe because the module was now in We can try triggering another build without building in |
|
Test report by @pavelToman Another weird error on litleo: |
|
Test report by @Flamefire Suspected system issue:
|
|
Test report by @Flamefire |
|
Test report by @Flamefire |
|
Test report by @pavelToman Again error: We have a lot of space left on /scratch but we hit a limit for Inodes: |
TF/Bazel takes an unholy amount of space, multiple GB |
|
Test report by @pavelToman |
|
This seems good to me, what do you think @Flamefire, is it ready to merge? It success on our gpu nodes litleo(H100) and accelgor(A100), either on jsc-zen3(A100). |
|
No objections, tests on our machines look good too |
|
Test report by @pavelToman |
|
Test report by @boegel |
|
Test report by @boegel |
|
Going in, thanks @pavelToman! |
|
Thanks go mainly to @Flamefire |
(created using
eb --new-pr)requires:
update easyblock:
patches from: